Current Issue : April - June Volume : 2021 Issue Number : 2 Articles : 6 Articles
Most binaural speech source localization models perform poorly in unprecedentedly noisy and reverberant situations. Here, this issue is approached by modelling a multiscale dilated convolutional neural network (CNN). The time-related crosscorrelation function (CCF) and energy-related interaural level differences (ILD) are preprocessed in separate branches of dilated convolutional network. The multiscale dilated CNN can encode discriminative representations for CCF and ILD, respectively. After encoding, the individual interaural representations are fused to map source direction. Furthermore, in order to improve the parameter adaptation, a novel semiadaptive entropy is proposed to train the network under directional constraints. Experimental results show the proposed method can adaptively locate speech sources in simulated noisy and reverberant environments....
With the innovation and development of network technology, people’s various needs are gradually increasing. Among various multimedia, music has different characteristics from other forms of multimedia. Music can contain many human emotions, and humans can express some shallow and deep emotions through music. Therefore, the study of music emotion in the context of the Internet is an area where the public is relatively concerned. In the context of new media on the Internet, based on the current music emotion model, this paper establishes a different music emotion model from the past through clear research and analysis. From music characteristics, some relative vector quantities are extracted to build samples, and the samples are screened on the basis of network media technology to build a musical emotion model. The experimental simulation results show that the music emotion model based on the blockchain network environment established in this paper has high applicability and efficiency....
Speech recognition consists of converting input sound into a sequence of phonemes, then finding text for the input using language models. Therefore, phoneme classification performance is a critical factor for the successful implementation of a speech recognition system. However, correctly distinguishing phonemes with similar characteristics is still a challenging problem even for state-ofthe- art classification methods, and the classification errors are hard to be recovered in the subsequent language processing steps. This paper proposes a hierarchical phoneme clustering method to exploit more suitable recognition models to different phonemes. The phonemes of the TIMIT database are carefully analyzed using a confusion matrix from a baseline speech recognition model. Using automatic phoneme clustering results, a set of phoneme classification models optimized for the generated phoneme groups is constructed and integrated into a hierarchical phoneme classification method. According to the results of a number of phoneme classification experiments, the proposed hierarchical phoneme group models improved performance over the baseline by 3%, 2.1%, 6.0%, and 2.2% for fricative, affricate, stop, and nasal sounds, respectively. The average accuracy was 69.5% and 71.7% for the baseline and proposed hierarchical models, showing a 2.2% overall improvement....
In this paper, we propose to incorporate the local attention in WaveNet-CTC to improve the performance of Tibetan speech recognition in multitask learning. With an increase in task number, such as simultaneous Tibetan speech content recognition, dialect identification, and speaker recognition, the accuracy rate of a single WaveNet-CTC decreases on speech recognition. Inspired by the attention mechanism, we introduce the local attention to automatically tune the weights of feature frames in a window and pay different attention on context information for multitask learning. The experimental results show that our method improves the accuracies of speech recognition for all Tibetan dialects in three-task learning, compared with the baseline model. Furthermore, our method significantly improves the accuracy for low-resource dialect by 5.11% against the specific-dialect model....
This paper is aimed at the problems of low accuracy, long recognition time, and low recognition efficiency in English speech recognition. In order to improve the accuracy and efficiency of English speech recognition, an improved ant colony algorithm is used to deal with the dynamic time planning problem. The core is to adopt an adaptive volatilization coefficient and dynamic pheromone update strategy for the basic ant colony algorithm. Using new state transition rules and optimal ant parameter selection and other improved methods, the best path can be found in a shorter time and the execution efficiency can be improved. Simulation experiments tested the recognition rates of traditional ant colony algorithm and improved ant colony algorithm. The results show that the global search ability and accuracy of improved ant colony algorithm are better than traditional algorithms, which can effectively improve the efficiency of English speech recognition system....
The most prominent form of human communication and interaction is speech. It plays an indispensable role for expressing emotions, motivating, guiding, and cheering. An ill-intentioned speech can mislead people, societies, and even a nation. A misguided speech can trigger social controversy and can result in violent activities. Every day, there are a lot of speeches being delivered around the world, which are quite impractical to inspect manually. In order to prevent any vicious action resulting from any misguided speech, the development of an automatic system that can efficiently detect suspicious speech has become imperative. In this study, we have presented a framework for acquisition of speech along with the location of the speaker, converting the speeches into texts and, finally, we have proposed a system based on long short-term memory (LSTM) which is a variant of recurrent neural network (RNN) to classify speeches into suspicious and nonsuspicious. We have considered speeches of Bangla language and developed our own dataset that contains about 5000 suspicious and nonsuspicious samples for training and validating our model. A comparative analysis of accuracy among other machine learning algorithms such as logistic regression, SVM, KNN, Naive Bayes, and decision tree is performed in order to evaluate the effectiveness of the system. The experimental results show that our proposed deep learning-based model provides the highest accuracy compared to other algorithms....
Loading....